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Abstract 

Background: Network Tools and Applications in Biology (NETTAB) Workshops are a series of meetings focused on 
the most promising and innovative ICT tools and to their usefulness in Bioinformatics. The NETTAB 201 1 workshop, 
held in Pavia, Italy, in October 2011 was aimed at presenting some of the most relevant methods, tools and 
infrastructures that are nowadays available for Clinical Bioinformatics (CBI), the research field that deals with clinical 
applications of bioinformatics. 

Methods: In this editorial, the viewpoints and opinions of three world CBI leaders, who have been invited to 
participate in a panel discussion of the NETTAB workshop on the next challenges and future opportunities of this 
field, are reported. These include the development of data warehouses and ICT infrastructures for data sharing, the 
definition of standards for sharing phenotypic data and the implementation of novel tools to implement efficient 
search computing solutions. 

Results: Some of the most important design features of a CBI-ICT infrastructure are presented, including data 
warehousing, modularity and flexibility, open-source development, semantic interoperability, integrated search and 
retrieval of -omics information. 

Conclusions: Clinical Bioinformatics goals are ambitious. Many factors, including the availability of high-throughput 
"-omics" technologies and equipment, the widespread availability of clinical data warehouses and the noteworthy 
increase in data storage and computational power of the most recent ICT systems, justify research and efforts in 
this domain, which promises to be a crucial leveraging factor for biomedical research. 



Background 

Clinical Bioinformatics (CBI) can be defined as "the clini- 
cal application of bioinformatics-associated sciences and 
technologies to understand molecular mechanisms and 
potential therapies for human diseases" [1]. Being specifi- 
cally focused on clinical context, CBI is characterized by 
the challenge of integrating molecular and clinical data to 
accelerate the translation of knowledge discovery into 
effective treatment and personalized medicine. CBI shares 
methods and goals with Translational Bioinformatics 
(TBI), which has been defined as the "development of sto- 
rage, analytic, and interpretative methods to optimize the 
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transformation of increasingly voluminous biomedical 
data - genomic data in particular - into proactive, predic- 
tive, preventive, and participatory health management" [2] . 
CBI and TBI can be thus considered as almost synon- 
ymous terms, being both related with the same set of 
scientific questions. In this paper we will refer to CBI, 
wanting to stress the clinical decision making aspects of 
bioinformatics, although we claim that the two terms are 
being used in current practice in an interchangeable 
manner. 

More specifically, CBI is aimed at providing methods 
and tools to support two different decision-makers. On 
the one hand, it should assist clinicians in dealing with 
clinical genomics (biomarker discovery), genomic medi- 
cine (identification of genotype/phenotype correlations), 
pharmacogenomics and genetic epidemiology at the 
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point of care (see [3] for a detailed discussion); on the 
other hand, it must support researchers in the proper 
reuse of clinical data for research purposes [4]. For this 
reason, together with bioinformatics problems, related to 
the management, analysis and integration of "-omics" 
data, CBI needs to deal with the proper definition of clin- 
ical decision-support strategies, an area deeply studied in 
the context of medical informatics and artificial intelli- 
gence in medicine. CBI is therefore at the confluence of 
different disciplines, and may foster the definition of a 
comprehensive framework to deal and manage all kinds 
of biomedical data, supporting their transformation into 
information and knowledge. 

Even if the main aim of CBI is very ambitious, there is a 
variety of enabling factors that strongly support research 
in this direction. First of all, in the last few years new 
genome sequencing and other high-throughput experi- 
mental techniques have generated vast amounts of mole- 
cular data, which, when coupled with clinical data, may 
lead to major biomedical discoveries, if properly 
exploited by researchers. 

Second, new diagnostic and prognostic tests based on 
molecular biomarkers are increasingly available to clini- 
cians, thus consistently refining the capability of dissect- 
ing diseases and, at the same time, enlarging the decision 
space on the basis of the improved assessment of risk. 

Third, the increasing online availability of the "bib- 
liome", i.e., the biomedical text corpus, made through 
published manuscripts, abstracts, textual comments and 
reports, as well as direct-to-Web publications, has stimu- 
lated the development of new algorithms able to semi- 
automatically extract knowledge from these texts so as to 
make it available in computable formats. Such algorithms 
have been proved to be able to effectively combine the 
information reported in the text with that contained in 
biological knowledge repositories and are increasingly 
used for hypothesis generation, or corroboration of clini- 
cal findings. Their use in the clinics poses challenges, but 
may be a consistent and important tool to support deci- 
sion-making. 

Finally, the consistent growth of publicly available data 
and knowledge sources and the possibility to easily access 
low-cost, high-throughput molecular technologies has 
meant that computational technologies and bioinfor- 
matics are increasingly central in genomic medicine; 
cloud computing technology is being recognised as a key 
technology for the future of genomic research to facilitate 
large-scale translational research. 

Network Tools and Applications in Biology (NETTAB) 
Workshops are a series of meetings focused on the most 
promising and innovative ICT tools and to their useful- 
ness in Bioinformatics [5]. They aim at introducing parti- 
cipants to the most promising among evolving network 
standards and technologies that are being applied to the 



biomedical application domain. Each year, they are 
focused on a different technology or domain for which 
talks on basic technologies, tools, and platforms of inter- 
est, as well as real applications, are presented. The NET- 
TAB 2011 workshop, held in Pavia, Italy, in October 
2011 was aimed at presenting some of the most relevant 
methods, tools and infrastructures that are nowadays 
available for CBI. 

In this paper, the viewpoints and opinions of three 
world CBI leaders, who have been invited to participate 
in a panel discussion of the NETTAB workshop on the 
next challenges and future opportunities of this field, are 
reported. 

Looking at CBI from the technological side, these 
experts have identified three areas that need advancement 
and further research. These include the development of 
data warehouses and ICT infrastructures for data sharing, 
the definition of standards for sharing phenotypic data 
and the implementation of novel tools to implement effi- 
cient search computing solutions. In the following of the 
editorial we report such opinions and discuss their rele- 
vance to the field. 

ICT infrastructures for supporting clinical bioinformatics: 
important design features of the i2b2 system 

i2b2 (Informatics for Integrating Biology and the Bed- 
side) is an NIH-funded National Center for Biomedical 
Computing based at Partners HealthCare System that is 
an integrated framework for using clinical data for 
research [4]. 

The back end of i2b2 has a modular software design, 
called the 'Hive,' that manages everything having to do 
with how data is stored and accessed. The front end of 
i2b2 is the i2b2 Web client, a user interface that allows 
researchers to query and analyze the underlying data. The 
software is open source and can be extended by users 
once the core cells of the Hive are included and correctly 
configured. 

To date, i2b2 has been deployed at over 70 sites 
around the world, where it is being used for cohort 
identification, hypothesis generation and retrospective 
data analysis. At many of these sites, additional func- 
tionality is being developed to suit the needs of the 
researchers. 

Several aspects of i2b2 contribute to its rapid adoption 
by the clinical research community. The first is that it is 
open source and therefore not only is it free to try, but 
there is a built-in set of collaborators - other users - with 
whom to engage both to get help with any questions and 
to foster innovation. The open source, self-service nature 
of i2b2 allows investigators to try out ideas stepwise at 
their own pace and at no financial cost. The online docu- 
mentation and community wild are kept up-to-date and 
greatly assist in user support. Secondly, both the fact that 
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it is open source and the modularity of the design enforce 
backward compatibility with existing research, so that it 
is added to the i2b2 platform and does not become 
obsolete. 

But perhaps the key to the utility of i2b2 is the simpli- 
city of its database design. A research data warehouse 
typically includes data from disparate sources, such as 
electronic health records, administrative systems, genetic 
and research data, and lab results, to name a few. The 
structure of the i2b2 database allows this data to be 
aggregated and optimized for rapid cross-patient search- 
ing in a way that is transparent to the user. The specific 
design and flexibility of the data model supports new 
research data being added to the database as it is 
amassed, while allowing users to construct complex 
queries against the multiple source systems. 

i2b2 data is stored in a star schema, first described by 
Kimball [6]. A very large central fact table (observation_- 
fact) is surrounded by and connected to the smaller 
dimension tables, i.e., the patient, observer, visit, concept 
and modifier dimensions (Figure 1). A fact is defined as 
an observation on a patient, made at a specific time, by a 
specific observer, during a specific event. Dimension 



tables hold descriptive information and attributes about 
the facts. 

The star schema is optimized for analytic querying and 
reporting. Its design tends to mirror the way users think 
about and use data, which is important since users must 
understand what data is available in order to formulate 
queries. The straightforward connections between the 
fact and dimension tables mean that navigation through 
the database via joins and drilling into or rolling up 
dimensional data is simple and quick. The design allows 
the fact table to grow to billions of rows while maintain- 
ing performance. Another advantage of the fact table 
design is that it is well suited to handle "sparse" data; 
data that has many possible attributes (such as all possi- 
ble medical concepts), but with only a few that are 
applicable. In this model, only positive facts are recorded, 
thus resulting in more efficient storage. 

Perhaps the most powerful aspect of the i2b2 database 
design is the design of the metadata. In i2b2, metadata is 
the vocabulary, all the medical terms that describe the 
facts in the database. Metadata is what allows users to 
interact with the database. A typical clinical data ware- 
house may have 100,000 to 500,000 concepts, including 
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Figure 1 The i2b2 star schema. 
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ICD-9 [7], SNOMED-CT [8], CPT [9], HCPCS [10], 
NDC [11] and LOINC [12] codes, as well as a host of 
local codes from in-house systems. Without an intuitive 
and easy-to-use structure, users would be stymied in 
understanding and using the codes. In i2b2, a hierarchical 
folder system is used to group the concepts. General 
terms are located in higher level folders, with more speci- 
fic but related terms in folders and leaves underneath. 
The way the metadata looks in the i2b2 Web client 
directly reflects its structure in the table (Figure 2). A 
user can drill up and down in the folders in the user 
interface to clearly see the hierarchy and find terms of 
interest. 

Maintaining and updating the metadata is a signifi- 
cant, but workable challenge. New medical codes are 
constantly being created, and old codes are discarded or 
changed. The structure of the metadata must be able to 
seamlessly absorb new codes while remaining backward 
compatible with old coding schemes. The hierarchical 



classification scheme of i2b2 makes it easy to map new 
codes to existing folders and to create new folders as 
needed. Entire new coding systems can be added just by 
creating a new folder. Discarded codes can remain in 
the hierarchy next to newer ones and used to reference 
older data, or hidden to discourage their usage in new 
queries. 

One goal of i2b2 is to help integrate data from the 
many different sources that exist in modern day health- 
care institutions in order to present a comprehensive 
view of patient care for research. The simple and intui- 
tive design of the i2b2 database enables users to con- 
struct complex queries over these disparate data sources. 

Using the new generation of Healthcare and Life Sciences 
standards for Personalized Medicine 

The success of Personalized Medicine (PM) at the point 
of care is dependent on the effective use of PM knowl- 
edge (e.g., pharmacogenomic interpretation of somatic 



(J/ i2b2 Web Client - Mozilla Firefox 
File Edit View History 8ookmarks Tools Help 



i2b2 Web Client 

^ i..J webservices.i2b2.org/webclient/ 



CI Goog/e 



i2b2 Query & Analysis Tool Project: )2t>2 Demo User: i2b2 User Find Patients I Analysis Tools I Message Log I Help I Logout 



Navigate Terms 



Find Terms 



2 ED 



,i3 Demographics 

H 0 Gender 
■■■Si Female 
Male 
Si Unknown 
S 0 Income 
K o Language 
El jj3 Marital Status 
S3 Race 
SI 0 Religion 
• o Vital Status 
SI jiS Zip codes 
b Diagnoses 
E p.- Circulatory system 
K n Conditions in the perinatal period 
SI [c2 Congenital anomalies 
E h Digestive system 
O Admit Diagnosis 
O Principal Diagnosis 
O Secondary Diagnosis 
ffl J3 Appendicitis 

♦ : Diseases of esophagus 

* : ; Hernia of abdominal cavity 
SI (g) Noninfectious enteritis and colitis 
SI 5i3 Oral cavity diseases 



Query Tool 



Query Name: | Appetidi-femaleQIO 10 21 



2 ED 



Temporal Constraint 



Treat all groups independently 



Group 1 a 


Group 2 □ 


Group] □ 


Dates | Occurs > Ox ExckjdeJ [ Dates Occurs > Ox | ExckKJeJ[ Dates Occurs > Ox Exclude 


til Appendicitis 


Si Female 





Run Query I Clear I Print Query 



2 Groups 



| New Group | 



Query Status 



Finished Query: "AppendiFemale@1 0:10:21" 
Compute Time: 6 sees 

Number of patients for "Appendi-Female@10:10:21" 

patient_count 1 



[11.3 sees] 



Figure 2 The i2b2 Web Client is shown. The characteristic terms (and their respective modifiers) that describe the patients in the Clinical 
Research Chart are shown in the tree structure on the left. The query is composed in the upper right with the logic of a "Venn-Diagram". Terms 
in two different Groups will be logically ANDED together, and number of patients will be shown after computation, in this case the number of 
patients who are both male and have had appendicitis. 
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mutations in tumor tissues) while considering the com- 
plete patient's medical history (e.g., other diseases, medi- 
cations, allergies, and genetic mutations). 

In order for PM knowledge to be effectively applied to 
the patient medical records, representations of data and 
knowledge need to be standardized due to the heteroge- 
neity of their original formats. Both data and knowledge 
are generated nowadays by a variety of sources, each of 
them using proprietary formats and idiosyncratic seman- 
tics, often not represented explicitly (for example, when 
contextual data is unstructured and thus cannot be 
parsed by decision support applications). 

Interpretation of clinical data typically starts at parsing 
the metadata, e.g., the predefined schemas of clinical 
information systems. However, these schemas (most 
often relational) cannot accommodate the complexity of 
contextual data representation. Thus, it is important to 
have a richer language allowing the explicit representa- 
tion of patient-specific context of each discrete data item 
and of how it relates to other data items, as well as how 
it fits within the entire health history of an individual. 

Dispersed and disparate medical records of a patient 
are often inconsistent and incoherent. A patient-centric, 
longitudinal electronic health record (EHR) based on 
international standards (e.g., CEN EHR 13606 [13]) could 
provide a coherent and explicit representation of the 
data's semantics. New PM evidences, generated by clini- 
cal research and validated in clinical trials and by data 
mining, should be represented in alignment with clinical 
data representations in a way that lends itself to PM reali- 
zation. A constantly growing stream of raw data is avail- 
able today in both research and clinical environments, 
e.g., DNA sequences and expression data along with rare 
variants and their presumed affected function, as well as 
sensor data along with deduced personal alerts. 

The representation of such raw data should adhere, as 
much as possible, to common and agreed-upon reference 
models (e.g., HL7/ISO RIM - Reference Information 
Model [14] or the openEHR RM - Reference Model [13]) 
that provide unified representations of the common con- 
structs needed for health information representation. For 
example, any observation could be represented in the 
same way in terms of its attributes, such as id, timing, 
code, value, method and status, but more importantly, 
using the same reference models could lead to the stan- 
dard representation of clinical statements (e.g., "observa- 
tion of gall bladder acute inflammation indicated having 
a procedure of cholecystectomy", or "EGFR variations 
cause resistance to Gefitinib"), where implicit semantics 
can become explicit and thus processable by decision 
support applications. 

The abovementioned reference models can underlie 
the logical models of health data warehousing. Such 
warehousing could maintain the richest semantic 



representation of data and knowledge in a way that is 
also interoperable with other information systems. Per- 
forming specific tasks, such as summarizing patient data 
or analyzing cohort data in research studies, needs more 
optimized representations of the data and knowledge 
persisted in warehouses. Data marts are such optimized 
representations, and multiple data marts could be 
derived from a single warehouse. For example, the star 
schema underlying the i2b2 framework (see Figure 1) 
could be seen as a generic data mart for translational 
research that could be based on data exported from a 
standardized data warehouse maintained by a single 
health organization or across organizations, such as in 
the case of clinical affinity domains or integrated deliv- 
ery networks. 

In many cross-enterprise warehousing efforts, the 
main format used to convey patient data is the Clinical 
Document Architecture (CDA) standard [15]. CDA 
documents strike a balance between physicians' narra- 
tives and structured data in order to facilitate the gra- 
dual transition from unstructured clinical notes to 
standardized and structured data. The same transforma- 
tion should also take place in knowledge representa- 
tions, from scientific papers in natural language to 
structured knowledge, for example. 

The efforts to apply Natural Language Processing 
(NLP) to health information could be connected to 
healthcare information technologies through standards 
like CDA that uses the clinical statement concept. The 
NLP fundamentals can be reduced to the clinical state- 
ment constituents and the CDA can thus be a good 
"catcher" of the results of NLP running over unstruc- 
tured health information. 

Search and extraction of relevant information from big 
data amounts 

The continuously increasing amount of available data 
poses significant technological and computational chal- 
lenges, both to their management (collection, storage, inte- 
gration, preservation) and effective use (access, sharing, 
search, extraction, analysis). This issue is becoming predo- 
minant in several fields and it is being addressed in differ- 
ent ways, according to each specific field peculiarities. 

The Web is a paradigmatic field for this aspect. A 
rapidly growing mass of data is flooding the Web. Yet, 
leveraging on the typical linked nature of Web data, tech- 
nological and computational advancements are prevent- 
ing (at least for now) drowning by Web data. Automatic 
robots have been implemented to crawl the Web 
resources, collect their huge key data and store them in 
powerful database management systems. Effective index- 
ing and ranking techniques, such as the Google PageRank 
[16], have been implemented to efficiently catalogue and 
sort Web resources according to their key data and likely 
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relevance. This enables Web search engines to provide 
lists of items which often include among their top 10 or 
20 items the one(s) that can reasonably answer numer- 
ous, yet simple, user search questions. 

Such ability, which is tremendously boosting the Web 
as an extraordinary easy-to-use source of information, is 
based on the assumption that user searches are mainly 
aimed at finding "at least one" or "the most evident" item 
that can answer his/her question. Current Web search 
technologies are not enough when search questions 
either become more complex, simultaneously involve dif- 
ferent topics, or require the retrieval of most of (if not 
all) available items regarding the question, possibly 
ordered according to different user-defined features. 
Furthermore, only an estimated limited part of all data 
accessible through the Web can actually be found by cur- 
rent search engines: the vast "deep Web" , including 
dynamic pages returned in response to a query or 
accessed through a form, resources protected by pass- 
word, sites limiting access by using various security tech- 
nologies (e.g., CAPTCHAs), and pages that are accessible 
through link-produced scripts, remains unrevealed. 

Especially in the CBI field, the amount of collected data 
is continuously and rapidly increasing, in particular with 
the recent collection of -omics data. Also, compared to 
the Web, the current ability of extracting relevant biome- 
dical information and of answering even common CBI 
questions is far less, due to many reasons. 

First, the biomedical-molecular data - which are of var- 
ious types - are stored in several different formats within 
systems that are distributed, heterogeneous, and often 
not interoperable. Furthermore, a lot of important infor- 
mation is subjectively described in free texts, within chief 
complaints, discharge letters, clinical reports or referrals, 
which are intrinsically unstructured. The adoption of 
electronic medical or health records can significantly 
enhance the availability and sharing of clinical data and 
information, which are still only on paper in very many 
healthcare sites. Yet, the digitalization of health data 
alone is far from sufficient; having clinical reports and 
referrals in PDF format is evidently not enough to solve 
the information extraction and question answering 
issues. A standard data and information representation 
according to a shared reference model has to be adopted, 
together with controlled terminologies and ontologies to 
objectively describe medical and biomolecular findings. 
Moreover, the use of advanced Natural Language Proces- 
sing techniques suited for the clinical domain to extract 
and structure information from previous medical textual 
descriptions can also greatly help. 

Second, usual biomedical-molecular questions are gen- 
erally more complex than Web search questions. They 
often involve more types of data, as well as topics with 
usually several attributes. In many cases, retrieving only a 



few of the items related to a biomedical-molecular search 
question, or even the K top items according to some 
user-defined ranking, may not be enough for a proper 
answer, which can instead require the exploration of all 
available items and their attributes. 

Advanced search computing techniques are being 
developed to answer complex, multi-topic Web search 
questions involving the integration of possibly ranked 
partial search results [17]. These techniques can also be 
applied in the CBI domain to tackle such issues, at least 
partially. Yet, the complex and heterogeneous nature of 
the biomedical data, as well as the multifaceted struc- 
ture of the clinical settings, pose formidable technologi- 
cal and organizational challenges for the effective 
management and use of biomedical-molecular data. In 
particular, integrated search and retrieval of bio-data, 
and their comprehensive analysis towards extraction of 
relevant information [18] and inference of biomedical 
knowledge, constitute some of the major challenges for 
the present and future of CBI, with a potential remark- 
able impact on the advancement of clinical research and 
patient treatment. 

Conclusions 

CBI goals are ambitious, but many factors, from the 
availability of high-throughput "-omics" technologies 
and equipment, allowing identifying "individual" gen- 
omes and proteomes, to the incredible increase in data 
storage and computational power that is allowed by 
most recent ICT systems, justify research and efforts in 
this domain. 

In this paper, we have reported some points of view 
on the current and future challenges in this domain that 
were discussed in a panel session at the NETTAB 2012 
workshop on Clinical Bioinformatics. 

First, we presented what we believe are the most 
important design features of a CBI-ICT infrastructure, 
by taking into account some achievements of the i2b2 
system. Data warehousing is essential in CBI because of 
the great amount of clinical and biomedical information 
that needs to be generated and managed within health 
organizations. Indeed, CBI depends on information that 
is gathered from single individuals, usually patients, and 
thus it cannot exclusively depend on general population 
or species oriented databases that are available on-line 
from main data providers. On the contrary, these gen- 
eral resources may only be used as a general reference, 
while the most important data is provided by indivi- 
dual's clinical and molecular information. 

Some of the most relevant features of a data ware- 
house for CBI have been identified by examining the 
i2b2 experience. Simplicity of the database schema is a 
key factor, facilitating the modularity and flexibility of 
the system, that support its continuous development 
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and improvement, and making optimization of queries 
and searches possible. Modularity is indeed essential 
also because of the various and heterogeneous data and 
sources that may be usefully included in the data ware- 
house, thus leading to a multiplicity of goals and appli- 
cation domains for the system. 

The open source approach is also extremely impor- 
tant, since it is able to fully exploit collaboration among 
users both for software development and for new fea- 
tures design. Collaborative development is especially 
important for the maintenance and update of shared 
metadata that, being used by the users as the main 
means to interact with the database, determine in fact 
its real usefulness and success. 

Of course, individuals are moving, and information is 
being accumulated in many health organizations and 
information systems, that need to interoperate so that all 
possible information on each patient is made swiftly avail- 
able when it is needed. This is a precondition for the clini- 
cians in order to be able to deal at the point of care with 
all needed information for a proper, molecular-enabled, 
diagnosis, prognosis and optimized treatment selection. 

Moreover, CBI data analysis may be greatly facilitated 
and improved when the population under analysis is the 
greatest possible. Such clinical-related processes as bio- 
marker discovery and identification of genotype/pheno- 
type correlations may only be carried out when a 
sufficient amount of data is available. So, interoperation 
of information systems should support both integration 
of data on single individuals and coming from many 
patients. Hence, it is of extreme relevance. 

In this paper, we have therefore also faced the intero- 
perability issue, and we have discussed about some of 
the most recent standards for data modelling and data 
interchange and their possible use in CBI to overcome 
heterogeneity of original data and knowledge formats, as 
well as modelling of arising information. In this case, 
international standards exist and should be adopted, 
with the provision that new evidence arising as a result 
of genomic medicine efforts be also properly included. 
We also highlighted that the application of a shared 
reference model could lead to a standard, semantics 
rich, processable representation of clinical statements. 

CBI is not limited, however, to the analysis of the infor- 
mation on a given individual by the health care personnel 
that provide him/her assistance. As previously said, it 
must also support researchers in the reuse of clinical data 
for research purposes. Many new applications are being 
developed by researchers in the field, who can largely 
benefit from making access and searching information 
resources through Web services. It is often from such 
free access to data that new associations may be identi- 
fied, possibly leading to hypothesis for new biomarkers 
validation and assessment. 



In this paper, we have therefore also tried to point out 
which currently are the main difficulties in making 
access and searching CBI related information sources. 

First, we addressed the idea that the adoption of com- 
mon data models could be the best starting point for 
the implementation of a set of data marts, optimized 
representations of data included in warehouses for per- 
forming specific research tasks. New data marts, each 
devoted to a different task, could easily be created and 
made available. 

Current techniques and technologies aimed at search- 
ing data on the Web, even the most advanced, do not 
seem completely adequate for CBI needs, where queries 
are complex, involving many data sources simulta- 
neously and, often, requesting from each resource more 
results than the first that are usually returned. One of 
the most demanding issues remains access to a lot of 
information that is included, and subjectively described, 
in free texts, which are intrinsically unstructured. The 
use of controlled terminologies and of ontologies, when- 
ever possible, together with the adoption of NLP tools 
suited for the clinical and biological domains can indeed 
support extraction of information from medical textual 
descriptions. 

We finally moved to the issue of searching and 
extracting information from big data amounts. Queries 
which are relevant in CBI often require retrieving result 
sets bigger than usual and the exploration of all avail- 
able items and their attributes because of possible corre- 
lations among data in the results that could change, 
even sensibly, their relevance to the overall query. Of 
particular interest are those advanced search computing 
techniques, which are aimed at integrating ranked 
search results from multiple sources. 

The integrated search and retrieval of CBI data from 
multiple sources and its comprehensive analysis consti- 
tute in our opinion one of the biggest challenges for the 
future. The NETTAB 2013 workshop will be devoted to 
this theme. 
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